Commit Graph

114 Commits

Author SHA1 Message Date
Seth Hoenig
05937ab75b exec2: add client support for unveil filesystem isolation mode (#20115)
* exec2: add client support for unveil filesystem isolation mode

This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.

* actually create alloc-mounts-dir directory

* fix doc strings about alloc mount dir paths
2024-03-13 08:24:17 -05:00
Marvin Chin
d75293d2ab Add OOM detection for exec driver (#19563)
* Add OomKilled field to executor proto format

* Teach linux executor to detect and report OOMs

* Teach exec driver to propagate OOMKill information

* Fix data race

* use tail /dev/zero to create oom condition

* use new test framework

* minor tweaks to executor test

* add cl entry

* remove type conversion

---------

Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-03 09:50:27 -06:00
Seth Hoenig
591394fb62 drivers: plumb hardware topology via grpc into drivers (#18504)
* drivers: plumb hardware topology via grpc into drivers

This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.

* cr: use enum instead of bool for core grade

* cr: fix test slit tables to be possible
2023-09-18 08:58:07 -05:00
Seth Hoenig
2e1974a574 client: refactor cpuset partitioning (#18371)
* client: refactor cpuset partitioning

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

* cr: tweaks for feedback
2023-09-12 09:11:11 -05:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
Seth Hoenig
a4cc76bd3e numa: enable numa topology detection (#18146)
* client: refactor cgroups management in client

* client: fingerprint numa topology

* client: plumb numa and cgroups changes to drivers

* client: cleanup task resource accounting

* client: numa client and config plumbing

* lib: add a stack implementation

* tools: remove ec2info tool

* plugins: fixup testing for cgroups / numa changes

* build: update makefile and package tests and cl
2023-08-10 17:05:30 -05:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Lance Haig
3160c76209 deps: Update ioutil library references to os and io respectively for drivers package (#16331)
* Update ioutil library references to os and io respectively for drivers package

No user facing changes so I assume no change log is required

* Fix failing tests
2023-03-08 10:31:09 -06:00
Farbod Ahmadian
fbd0dcbe9b tests: add functionality to skip a test if it's not running in CI and not with root user (#16222) 2023-03-02 13:38:27 -05:00
James Rasell
25e7c2ffa4 chore: remove use of "err" a log line context key for errors. (#14433)
Log lines which include an error should use the full term "error"
as the context key. This provides consistency across the codebase
and avoids a Go style which operators might not be aware of.
2022-09-01 15:06:10 +02:00
Piotr Kazmierczak
c4be2c6078 cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Eng Zer Jun
fca4ee8e05 test: use T.TempDir to create temporary test directory (#12853)
* test: use `T.TempDir` to create temporary test directory

This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.

Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
	defer func() {
		if err := os.RemoveAll(dir); err != nil {
			t.Fatal(err)
		}
	}
is also tedious, but `t.TempDir` handles this for us nicely.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix TestLogmon_Start_restart on Windows

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestConsul_Integration

t.TempDir fails to perform the cleanup properly because the folder is
still in use

testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-05-12 11:42:40 -04:00
Seth Hoenig
37ffd2ffa2 cgroups: make sure cgroup still exists after task restart
This PR modifies raw_exec and exec to ensure the cgroup for a task
they are driving still exists during a task restart. These drivers
have the same bug but with different root cause.

For raw_exec, we were removing the cgroup in 2 places - the cpuset
manager, and in the unix containment implementation (the thing that
uses freezer cgroup to clean house). During a task restart, the
containment would remove the cgroup, and when the task runner hooks
went to start again would block on waiting for the cgroup to exist,
which will never happen, because it gets created by the cpuset manager
which only runs as an alloc pre-start hook. The fix here is to simply
not delete the cgroup in the containment implementation; killing the
PIDs is enough. The removal happens in the cpuset manager later anyway.

For exec, it's the same idea, except DestroyTask is called on task
failure, which in turn calls into libcontainer, which in turn deletes
the cgroup. In this case we do not have control over the deletion of
the cgroup, so instead we hack the cgroup back into life after the
call to DestroyTask.

All of this only applies to cgroups v2.
2022-05-05 09:51:03 -05:00
Tim Gross
3671ea6a8f remove pre-0.9 driver code and related E2E test (#12791)
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
2022-04-27 09:53:37 -04:00
Seth Hoenig
c7836c6c8a exec: fix exec handler test
Fixup this test to handle cgroups v2, as well as the :misc: cgroup
2022-04-06 12:11:37 -05:00
Seth Hoenig
be7ec8de3e raw_exec: make raw exec driver work with cgroups v2
This PR adds support for the raw_exec driver on systems with only cgroups v2.

The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.

There is a bit of refactoring in this PR, but the fundamental design remains
the same.

Closes #12351 #12348
2022-04-04 16:11:38 -05:00
Seth Hoenig
5da1a31e94 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
Seth Hoenig
b242957990 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Seth Hoenig
8492c6576e build: upgrade and speedup circleci configuration
This PR upgrades our CI images and fixes some affected tests.

- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)

- eliminate go-machine-recent-image (no longer necessary)

- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)

- fix tcp dial error check (message seems to be OS specific)

- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)

- use safe MkdirTemp for generating tmpfiles

NOT applied: (too flakey)

- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)

- upgrade resource type for all imanges to large (2C -> 4C)
2022-01-24 08:28:14 -06:00
Mahmood Ali
6c414cd5f9 gofmt all the files
mostly to handle build directives in 1.17.
2021-10-01 10:14:28 -04:00
Seth Hoenig
595cef8136 drivers/exec: pass capabilities through executor RPC
Add capabilities to the LaunchRequest proto so that the
capabilities set actually gets plumbed all the way through
to task launch.
2021-05-17 12:37:40 -06:00
Seth Hoenig
c34beb48b1 drivers/docker: reuse capabilities plumbing in docker driver
This changeset does not introduce any functional change for the
docker driver, but rather cleans up the implementation around
computing configured capabilities by re-using code written for
the exec/java task drivers.
2021-05-17 12:37:40 -06:00
Seth Hoenig
9bb4b8fa04 drivers/java: enable setting allow_caps on java driver
Enable setting allow_caps on the java task driver plugin, along
with the associated cap_add and cap_drop options in java task
configuration.
2021-05-17 12:37:40 -06:00
Seth Hoenig
191144c3bf drivers/exec: enable setting allow_caps on exec driver
This PR enables setting allow_caps on the exec driver
plugin configuration, as well as cap_add and cap_drop in
exec task configuration. These options replicate the
functionality already present in the docker task driver.

Important: this change also reduces the default set of
capabilities enabled by the exec driver to match the
default set enabled by the docker driver. Until v1.0.5
the exec task driver would enable all capabilities supported
by the operating system. v1.0.5 removed NET_RAW from that
list of default capabilities, but left may others which
could potentially also be leveraged by compromised tasks.

Important: the "root" user is still special cased when
used with the exec driver. Older versions of Nomad enabled
enabled all capabilities supported by the operating system
for tasks set with the root user. To maintain compatibility
with existing clusters we continue supporting this "feature",
however we maintain support for the legacy set of capabilities
rather than enabling all capabilities now supported on modern
operating systems.
2021-05-17 12:37:40 -06:00
Nick Ethier
38bc1b1a31 client/fingerprint: move existing cgroup concerns to cgutil 2021-04-13 13:28:36 -04:00
Seth Hoenig
836ee9e4a2 drivers/exec+java: Add task configuration to restore previous PID/IPC isolation behavior
This PR adds pid_mode and ipc_mode options to the exec and java task
driver config options. By default these will defer to the default_pid_mode
and default_ipc_mode agent plugin options created in #9969. Setting
these values to "host" mode disables isolation for the task. Doing so
is not recommended, but may be necessary to support legacy job configurations.

Closes #9970
2021-02-08 14:26:35 -06:00
Seth Hoenig
6dd5de4b69 docs: fixup comments, var names 2021-02-08 10:58:44 -06:00
Seth Hoenig
b682371a22 drivers/exec+java: Add configuration to restore previous PID/IPC namespace behavior.
This PR adds default_pid_mode and default_ipc_mode options to the exec and java
task drivers. By default these will default to "private" mode, enabling PID and
IPC isolation for tasks. Setting them to "host" mode disables isolation. Doing
so is not recommended, but may be necessary to support legacy job configurations.

Closes #9969
2021-02-05 15:52:11 -06:00
vagrant
c1910544be attempting to fix flaky tests caused by pid isolation 2021-01-28 12:03:20 +00:00
Chris Baker
6a067a33ff modify exec driver test TestExecDriver_StartWaitStop in light of the fact that signaling sleep with SIGINT doesn't work if it's PID1 2021-01-28 12:03:19 +00:00
Chris Baker
611abc33a5 modify exec driver test TestExecDriver_DestroyKills all in light of the fact that PID namespacing means that the kernel does this now 2021-01-28 12:03:19 +00:00
Mahmood Ali
7b9d6a3552 tests: ignore empty cgroup
My latest Vagrant box contains an empty cgroup name that isn't used for
isolation:

```
$ cat /proc/self/cgroup  | grep ::
0::/user.slice/user-1000.slice/session-17.scope
```
2020-10-01 10:23:13 -04:00
Nick Ethier
af6aee3b8a fix test failures from rebase 2020-06-18 11:05:32 -07:00
Nick Ethier
e9ff8a8daa Task DNS Options (#7661)
Co-Authored-By: Tim Gross <tgross@hashicorp.com>
Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>
2020-06-18 11:01:31 -07:00
Mahmood Ali
d6c75e301e cleanup driver eventor goroutines
This fixes few cases where driver eventor goroutines are leaked during
normal operations, but especially so in tests.

This change makes few modifications:

First, it switches drivers to use `Context`s to manage shutdown events.
Previously, it relied on callers invoking `.Shutdown()` function that is
specific to internal drivers only and require casting.  Using `Contexts`
provide a consistent idiomatic way to manage lifecycle for both internal
and external drivers.

Also, I discovered few places where we don't clean up a temporary driver
instance in the plugin catalog code, where we dispense a driver to
inspect and validate the schema config without properly cleaning it up.
2020-05-26 11:04:04 -04:00
Tim Gross
8860b72bc3 volumes: return better error messages for unsupported task drivers (#8030)
When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.

This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.

Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.

Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
2020-05-21 09:18:02 -04:00
Thomas Lefebvre
5a017acd0b client: support no_pivot_root in exec driver configuration 2020-02-18 09:27:16 -08:00
Mahmood Ali
ac9547e6b2 drivers: always initialize taskHandle.logger
Looks like the RecoverTask doesn't set taskHandle.logger field causing
a panic when the handle attempts to log (e.g. when Shutdown or Signaling
fails).
2019-11-22 10:44:59 -05:00
Mahmood Ali
bdef161e20 changelog and comment 2019-11-19 15:51:08 -05:00
Mahmood Ali
6878134a7f always destroy 2019-11-18 21:31:29 -05:00
Mahmood Ali
a15bdc130d Add tests for orphaned processes 2019-11-18 21:31:29 -05:00
Nick Ethier
c36fe98198 driver: set correct network isolation caps for exec and java dr… (#6368) 2019-09-25 11:48:14 -04:00
Nick Ethier
e26192ad49 Driver networking support
Adds support for passing network isolation config into drivers and
implements support in the rawexec driver as a proof of concept
2019-07-31 01:03:20 -04:00
Nick Ethier
da3978b377 plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier
4a8a96fa1a ar: initial driver based network management 2019-07-31 01:03:17 -04:00
Mahmood Ali
2cc2e60ded update comment 2019-06-11 13:00:26 -04:00
Mahmood Ali
c72bf13f8a exec: use an independent name=systemd cgroup path
We aim for containers to be part of a new cgroups hierarchy independent
from nomad agent.  However, we've been setting a relative path as
libcontainer `cfg.Cgroups.Path`, which makes libcontainer concatinate
the executor process cgroup with passed cgroup, as set in [1].

By setting an absolute path, we ensure that all cgroups subsystem
(including `name=systemd` get a dedicated one).  This matches behavior
in Nomad 0.8, and behavior of how Docker and OCI sets CgroupsPath[2]

Fixes #5736

[1] d7edf9b2e4/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/apply_raw.go (L326-L340)
[2] 238f8eaa31/vendor/github.com/containerd/containerd/oci/spec.go (L229)
2019-06-10 22:00:12 -04:00
Mahmood Ali
82611af925 drivers/exec: Restore 0.8 capabilities
Nomad 0.9 incidentally set effective capabilities that is higher than
what's expected of a `nobody` process, and what's set in 0.8.

This change restores the capabilities to ones used in Nomad 0.9.
2019-05-20 13:11:29 -04:00
Mahmood Ali
494642b11c typo: "atleast" -> "at least" 2019-05-13 10:01:19 -04:00
Mahmood Ali
74e5e20c0b drivers: implement streaming exec for executor based drivers
These simply delegate call to backend executor.
2019-05-10 19:17:14 -04:00